Appendix F — Assignment F

Data wrangling

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Do not write your name on the assignment.

  3. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  4. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  5. The assignment is worth 100 points, and is due on 19th November 2023 at 11:59 pm.

  6. You are not allowed to use a for loop in this assignment.

  7. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (2 pts).
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
  • Final answers of each question are written in Markdown cells (1 pt).
  • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

F.1 Fifa world cup

Read FIFA world cup attendance data from the page: https://en.wikipedia.org/wiki/FIFA_World_Cup . Use ‘attendance’ as the matching string to find the table.

F.1.1

Find the number of levels of column labels and row labels in the data.

(2 points)

F.1.2

Reduce the multiple levels of column labels to a single level as follows. If the column names at all the levels are different, then concatenate the names together. Otherwise, keep the name at the outer level. For example, if the column name is (‘Hosts’,‘Hosts’), it should change to ‘Hosts’. If the column name is (‘Highest attendances †’,‘Number’), it should change to ‘Highest attendances †Number’. Do not rename each column manually. Use a method that will work efficiently if there were a large number of columns, say \(10,000\) columns.

Reminder: Do not use a for loop.

(10 points)

F.2 GDP per capita and population

Read the GDP per capita data from https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita

F.2.1 Preparing GDP per capita data

F.2.1.1

Drop all the Year columns. Use the drop() method with the columns, level and inplace arguments. Print the first 2 rows of the updated DataFrame. If the first row of the DataFrame has missing values for all columns, drop that row as well.

(4 points)

F.2.1.2

Drop the inner level of column labels. Use the droplevel() method. Print the first 2 rows of the updated DataFrame.

(4 points)

F.2.1.3

Convert the columns consisting of GDP per capita by IMF, World Bank, and the United Nations to numeric. Apply a lambda function on these columns to convert them to numeric. Print the number of missing values in each column of the updated DataFrame.

Note: Do not apply the function 3 times. Apply it once on a DataFrame consisting of these 3 columns.

(4 points)

F.2.1.4

Apply the lambda function below on all the column names of the dataset obtained in the previous question to clean the column names.

import re

column_name_cleaner = lambda x:re.split(r'\[|/', x)[0]

Note: You will need to edit the parameter of the function, i.e., x in the above function to make sure it is applied on column names and not columns.

Print the first 2 rows of the updated DataFrame.

(5 points)

F.2.1.5

Create a new column GDP_per_capita that copies the GDP per capita values of the United Nations. If the GDP per capita is missing in the United Nations column, then copy it from the World Bank column. If the GDP per capita is missing both in the United Nations and the World Bank columns, then copy it from the IMF column.

Print the number of missing values in the GDP_per_capita column.

(6 points)

F.2.1.6

Drop all the columns except Country and GDP_per_capita. Print the first 2 rows of the updated DataFrame.

(2 points)

F.2.1.7

The country names contain some special characters (characters other than letters) and need to be cleaned. The following function can help clean country names:

import re

country_names_clean_gdp_data = lambda x: re.sub(r'[^\w\s]', '', x).strip()

Apply the above lambda function on the country column to clean country names. Save the cleaned dataset as gdp_per_capita_data. Print the first 2 rows of the updated DataFrame.

(3 points)

F.2.2 Preparing population data

F.2.2.1

Read the population data from https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations). Drop all columns except Country / Area and Population (1 July 2023).

(2 points)

F.2.2.2

Apply the lambda function below on all the column names of the dataset obtained in the previous question to clean the column names.

import re

column_name_cleaner = lambda x:re.split(r'\[|/|\(| ', x.name)[0]

Note: You will need to edit the parameter of the function, i.e., x in the above function to make sure it is applied on column names and not columns.

Print the first 2 rows of the updated DataFrame.

(5 points)

F.2.2.3

The country names contain some special characters (characters other than letters) and need to be cleaned. The following function can help clean country names:

import re

country_names_clean_population_data = lambda x: re.sub("[\(\[].*?[\)\]]", "", x).strip()

Apply the above lambda function on the country column to clean country names. Save the cleaned dataset as population_data.

(3 points)

F.2.3 Merging GDP per capita and population datasets

F.2.3.1

Merge gdp_per_capita_data with population_data to get the population and GDP per capita of countries in a single dataset. Print the first two rows of the merged DataFrame.

Assume that:

  1. We want to keep the GDP per capita of all countries in the merged dataset, even if their population in unavailable in the population dataset. For countries whose population in unavailable, their Population column will show NA.

  2. We want to discard an observation of a country if its GDP per capita is unavailable.

(4 points)

F.2.3.2

For how many countries in gdp_per_capita_data does the population seem to be unavailable in population_data? Note that you don’t need to clean country names any further than cleaned by the functions provided.

Print the observations of gdp_per_capita_data with missing Population.

(3 points)

F.2.4 Merging datasets with similar values in the key column

We suspect that population of more countries may be available in population_data. However, due to unclean country names, the observations could not merge. For example, the country Guinea Bissau is mentioned as GuineaBissau in gdp_per_capita_data and Guinea-Bissau in population_data. To resolve this issue, we’ll use a different approach to merge datasts. We’ll merge the population of a country to an observation in the GDP per capita dataset, whose name in population_data is the most ‘similar’ to the name of the country in gdp_per_capita_data.

F.2.4.1

Proceed as follows:

  1. For each country in gdp_per_capita_data, find the country with the most ‘similar’ name in population_data, based on the similarity score. Use the lambda function provided below to compute the similarity score between two strings (The higher the score, the more similar are the strings. The similarity score is \(1.0\) if two strings are exactly the same).

  2. Merge the population of the most ‘similar’ country to the country in gdp_per_capita_data. The merged dataset must include 5 columns - the country name as it appears in gdp_per_capita_data, the GDP per capita, the country name of the most ‘similar’ country as it appears in population_data, the population of that country, and the similarity score between the country names.

  3. After creating the merged dataset, print the rows of the dataset that have similarity scores less than 1.

Use the function below to compute the similarity score between the Country names of the two datasets:

from difflib import SequenceMatcher

similar = lambda a,b: SequenceMatcher(None, a, b).ratio()

Note: You may use one for loop only for this particular question. However, if don’t use a for loop, you will get 10 bonus points.

(18 points)

Hint:

  1. Define a function that computes the index of the observation having the most ‘similar’ country name in population_data for an observation in gdp_per_capita_data. The function returns a Series consisting of the most ‘similar’ country name, its population, and its similarity score (This function can be written with only one line in its body, excluding the return statement and the definition statement. However, you may use as many lines as you wish).

  2. Apply the function on the Country column of gdp_per_capita_data. A DataFrame will be obtained.

  3. Concatenate the DataFrame obtained in (2) with gdp_per_capita_data with the pandas concat() function.

F.2.4.2

In the dataset obtained in the previous question, for all observations where similarity score is less than 0.8, replace the population with Nan.

Print the observations of the dataset having missing values of population.

(2 points)

F.3 GDP, surplus, and compensation

The dataset Real GDP.csv contains the GDP of each US State for all years starting from 1997 until 2020. The data is at State level, i.e., each observation corresponds to a unique State.

The dataset Surplus.csv contains the surplus of each US State for all years starting from 1997 until 2020. The data is at year level, i.e., each observation corresponds to a unique year.

The dataset Compensation.csv contains Compensation and Chain-type quantity indexes for real GDP for each US State and year starting from 1997 to 2020. The dataset is at Year-State-Description level, i.e., each observation corresponds to a unique Year-State-Description combination where Description refers to either Compensation or Chain-type quantity indexes for real GDP.

F.3.1 Combining datasets

Combine all these datasets to obtain a dataset at State-Year level, i.e., each observation corresponds to a unique State-Year combination. The combined dataset must contain the GDP, surplus, Compensation, and Chain-type quantity indexes for real GDP for each US State and all years starting from 1997 until 2020. Note that each observation must contain the name of the US State, year, and the four values (GDP, surplus, compensation, and Chain-type quantity indexes for real GDP).

Hint: Here is one way to do it:

  1. Melt the GDP dataset to year-State level

  2. Melt the Surplus dataset to year-State level

  3. Pivot the compensation dataset to year-State level

  4. Now that all the datasets are at the year-State level, merge them!

(3 + 3 + 3 + 1 = 10 points)

F.3.2 Time trend: GDP, surplus, and compensation

Use a single plot to answer all three questions below by visualizing:

  1. How does the mean GDP (mean over all States) change with year? (1 point for visualization)

  2. How does the mean compensation (mean over all States) change with year? (1 point for visualization)

  3. How does the mean surplus (mean over all States) change with year? (1 point for visualization)

Also show the 95% confidence interval for the mean GDP, mean compensation, and mean surplus in the plot.

Hint: Use the seaborn function lineplot() . No calculations are needed. Just use lineplot() three times.

(4 points)

F.3.3 Time trend: GDP with region

Merge the file State_region_mapping.csv with the dataset obtained in the previous question. Make a lineplot showing the mean GDP for each of the five regions with year. Do not display the confidence interval. Which two regions seems to have the least growth in GDP over the past 24 years?

(4 points)